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ABSTRACT 

In  finding  discrete  solutions  of  systems  of  partial  differential 
equations  on  a  computer,  one  is  faced  with  the  problem  that  the  desired 
number  of  mesh  points  may  exceed  the  machine's  fast  memory.   This  problem 
will  be  common  on  machines  such  as  ILLIAC  IV,  for  which  extremely  high 
computing  power  invites  the  use  of  meshes  many  millions  of  words  in  size. 
Because  of  the  high  dollar  price  of  fast  memory,  it  is  sensible  to  look 
at  large  disk  stores  with  high  transmission  speeds  as  back-up  storage  for 
meshes  and  arrays.  The  main  problem  encountered  is  the  access  time  of 
such  a  storage  unit. 

The  address  of  a  block  of  data  stored  on  disk  might  be  taken  as 
the  address  of  the  first  word  in  the  block.   This  address  must  specify 
both  track  number  and  radial  position  of  the  block.   If  the  computer  issues 
a  command  to  transmit  the  block  immediately  after  the  beginning  of  the 
block  has  passed  the  reading  head,  then  the  system  must  wait  for  nearly 
a  complete  disk  rotation  before  the  transmission  is  started.   This  access 
time  is,  in  general,  not  predictable;  but  it  is  bounded  by  the  disk  rota- 
tion time. 

The  access  time  during  which  the  computer  does  useless  work  is 
latency;  and  the  object  of  the  present  investigation  to  minimize  the 
latency  for  a  reasonably  large  class  of  problems. 
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1.   INTRODUCTION 

1.1  The  Problem 

A  reasonably  large  class  of  problems  is  the  class  of  two- 
dimensional  partial  differential  equation  (PDE)  problems.   Characteristic 
of  finite  difference  methods  for  solving  PDE's  are  the  stars  or  stencils 
of  the  methods,  an  example  of  which  is  shown  below.   In  order  to  compute 


d   points  deep 


t 


new  Values  of  the  variables  (i.e.,  to  update  the  variables)  at  the  center 
point  of  the  stencil,  one  needs  to  know  the  values  of  variables  at  several 
neighboring  points  in  the  horizontal  and  vertical  directions.   We  may  speak 
of  the  depth  d  of  the  stencil  as  the  distance  from  the  point  to  be  updated 
to  the  furthest  neighbor,  measured  in  points.  For  the  nine-point  stencil 
illustrated  above,  the  depth  d  is  two  points. 

When  one  wishes  to  update  an  entire  mesh  of  points,  the  stencil 
must  be  applied  to  each  of  the  points  simultaneously,  so  that  neighbors 
used  to  update  each  point  are  old  values  and  not  updated  values.   If  the 
entire  mesh  cannot  be  contained  in  the  fast  memory  of  the  computer,  one 
must  store  the  mesh  in  a  back-up  storage  device  such  as  magnetic  disk,  and 
read  in  only  a  part  of  the  mesh  at  a  time  for  updating. 

In  this  investigation,  it  is  assumed  that  the  mesh  is  rectangular 
in  shape.   It  can  then  be  sliced  into  rectangular  blocks;  and  the  blocks 


may  be  transmitted  back  and  forth  between  the  disk  and  the  computer  memory 
for  updating  calculations.   The  object  of  this  investigation  is  to  formulate 
a  scheme  for  efficiently  swapping  these  blocks  of  mesh  between  disk  and 
fast  memory.   One  problem  is  that  to  update  one  block,  the  computer  must 
have  access  to  an  edge  of  each  of  the  neighboring  blocks.   The  depth  of 
this  interblock  communication  must  be  d  points,  the  depth  of  the  stencil 
which  is  applied  to  the  mesh.  Figures  la  and  lb  illustrate  the  slicing 
of  the  mesh  and  the  communication  necessary. 

The  program  or  subroutine  which  performs  the  updating  calculations 
on  a  block  and  its  neighbor  edges  can  be  the  same  for  each  block  of  mesh, 
with  some  conditional  branches  to  handle  blocks  on  mesh  boundaries.   This 
subroutine  can  be  called  the  kernel,  to  distinguish  it  from  the  supervisory 
program  which  handles  input/output  and  other  chores.   The  main  constituent 
of  the  kernel  will  be  the  stencil  calculations  of  the  method  used. 

There  are  two  ways  in  which  a  kernel  is  likely  to  sweep  or  pass 
over  the  mesh:  sequentially  by  rows  or  by  columns.  In  sweeping  by  rows, 
blocks  will  be  input  (and  output)  in  the  order  11,  12,  ...,  In,  21,  22, 

...,  2n,  31,  ,   nin.   In  this  case,  it  is  clear  that  when  updating 

block  (i,j),  nothing  special  need  be  done  to  input  edges  from  blocks  (i,j-l) 
and  (i,j+l).   We  simply  save  in  fast  memory  the  rightmost  edge,  d  points 
deep,  of  block  (i,j-l),  which  will  have  just  been  updated  and  output  to 
disk;  and  we  delay  calculations  on  (i,j)  long  enough  to  allow  input  of 
block  (i,j+l).   The  more  difficult  problem  is  arranging  to  have  access 
to  the  lower  edges  of  blocks  in  row  i+1  and  the  upper  edges  of  blocks  in 
row  i-1.   The  scheme  presented  in  this  report  uses  multiple  storage  of 
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Figure  la.     A  general  rectangular  mesh  sliced  into  blocks, 
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Figure  lb.  Edge  values  needed  for  kernel  calculations 
on  block  (i, j) . 


upper  and  lower  edge  points  on  disk.   Upper  and  lower  edge  points  are 
grouped  by  rows  into  an  "edge  area"  on  disk,  and  also  appear  in  the  block 
storage  or  "mesh  area".  A  similar  arrangement  is  made  for  sweeping  the 
mesh  by  columns;  the  essential  difference  between  the  two  possible  storage 
schemes  is  the  grouping  of  edge  values  in  the  edge  area.   The  transposition 
from  row  storage  to  column  storage  of  edges  and  vice  versa  may  be  done  dur- 
ing a  sweep,  so  that  successive  kernels  in  a  program  may  switch  freely  from 
row-sweeping  to  column-sweeping,  if  this  is  ever  required. 

1.2  The  Machine 

We  will  consider  implementing  an  I/O  scheme  on  one  quadrant  of 
ILLIAC  IV  with  20^8  quadrant  words,  or  20H8  x  6k   words,  of  fast  memory  [l]. 
Coupled  to  the  machine  is  DISK  IV,  which  has  a  storage  capacity  of  approx- 
imately 15  million  words.   DISK  IV  is  organized  into  hQ   tracks  of  1200 
segments  each.   Disregarding  parity  bits,  each  segment  is  equivalent  to 
one  long  word  across  a  four-quadrant  ILLIAC  IV,  or  four  quadrant  words. 
The  smallest  I/O  transaction  allowed  is  one  segment.   In  the  scheme  to  be 
proposed,  a  logical  track  will  be  divided  into  a  number  of  blocks  of  b 
segments  each.   Each  mesh  block  will  be  stored  in  some  block  of  disk 
segments. 

For  a  slight  increase  in  generality,  we  will  allow  stringing 
several  disk  tracks  together  to  form  a  logical  track  of  1200 t  segments,  t 
an  integer.   We  will  have  to  take  account  of  the  fact  that  the  read/write 
head  cannot  switch  instantaneously  from  one  track  to  another.   DISK  IV 
actually  has  a  R/w  head  for  every  track,  and  switching  is  done  electronically. 


However,  it  still  takes  a  measurable  amount  of  time  to  change  tracks.   In 
fact,  the  switch  can  be  performed  within  the  space  of  disk  revolution 
through  two  segments;  thus  two  empty  segments  must  appear  at  the  end  (or 
beginning)  of  each  block  of  b  segments  and  each  1200-segment  disk  track. 
The  disk  may  be  pictured  as  in  Figure  2.   Points  A  and  A  are 
the  same  radial  position,  so 


A 


1200t  segments 


r 


k8 

t 
logical 
tracks 


Figure  2.   Schematic  of  disk. 


that  segments  may  be  numbered  modulo  1200t.   The  read/write  head  is  imagined 
to  move  constantly  from  left  to  right,  and  can  change  tracks  within  the 
space  of  two  segments. 
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2.   MAPPING  THE  MESH  ONTO  DISK 

The  blocks  (i,j)  of  Figure  la  are  skewed  on  disk  to  allow  row  or 
column  sweeping.  For  row  sweeping  upper  and  lower  edge  values  from  each 
row  of  blocks  are  duplicated  into  a  number  of  logical  tracks  set  aside  for 
edge  storage.   These  edge  values  are  grouped  as  shown  in  Figure  3.   Upper 
or  lower  edges  from  k  blocks  of  mesh  form  one  group  of  edges.   The  groups 
are  numbered  with  Roman  numerals  to  indicate  row,  superscripts  U  and  L  for 
upper  and  lower,  and  subscripts  to  distinguish  groups  in  a  particular  row. 
For  row  R,  group  R  contains  edges  from  blocks  (R,l)  to  (R,k-l),  Rp  contains 
edges  from  blocks  (R,k)  to  (R,2k-l),  etc.   Grouping  for  both  R.  and  R.  is 
summarized  in  the  following  table: 


Edges  from 

Edges  from 

Group 

Blocks  in  Columns: 

Group 

Blocks  in  Columns 

IxL 

1  to  k-1 

l!U 

1  to  k-2 

I2L 

k  to  2k-l 
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k-1  to  2k-2 

1/ 
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U  -l)k  to  n 
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(i  -l)k-l  to  n 


The  parameter  I     is  the  smallest  integer  such  that 
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The  overhanging  outlines  of  edge  groups  at  the  mesh  boundaries  merely 

indicate  that  there  are  empty  spaces  in  those  edge  groups.  For  sweeping 

the  mesh  "by  columns,  left  and  right  edges  of  blocks  are  grouped  analogously. 

A  smallest  I     is  chosen  such  that 
m 

m  <  I   k-  2 
—  m 

Note  that  k  will  not  be  subscripted.  In  problems  requiring  transposition, 
k  will  have  the  same  value  in  both  directions.  All  of  the  parameters  used 
in  constructing  a  scheme  will  be  constant  throughout  problem  execution. 

For  the  purpose  of  describing  the  map  of  mesh  onto  disk,  we  will 
further  simplify  the  schematic  of  disk  by  assuming  the  logical  track  to 
have  a  length  of  kT-1  (T  defined  below)  blocks  of  b  segments  each.   We 
will  store  each  group  of  edges  and  each  block  of  mesh  in  some  block  on 
disk  as  shown  in  Figure  h   (a  fold-out  chart).  Figure  k   is  a  snapshot  of 
a  radial  section  of  the  disk.   It  is  intended  to  appear  as  the  left  end 
of  Figure  2.   The  map  is  drawn  for  "period"  T=5  blocks.  T  measured  in 
units  of  time  will  be  the  kernel  calculation  time  for  one  block  of  mesh. 
T  measured  in  blocks  is  the  ratio  of  kernel  calculation  time  for  a  block 
to  the  input  transmission  time  for  a  block.   In  the  analysis  below,  T  will 
be  measured  in  blocks. 

Figure  k   shows  edges  grouped  as  in  Figure  3j  i.e.,  for  row- 
sweeping.  For  column  sweeping,  edges  would  be  regrouped  and  re-allocated, 
mesh  blocks  would  retain  the  storage  allocation  shown.   The  reason  mesh 
blocks  need  not  be  re-allocated  is  that  skewing  makes  allocation  look  the 
same  for  both  rows  and  columns  if  one  considers  radial  positions  only 


(horizontally  across  page)  and  disregards  track  number  (vertically  down 
page).  Block  (i,j)  is  in  the  same  radial  position  as  block  (j,i).   Let 
p(x)  be  the  radial  position  (disk  block  number  measured  from  an  arbitrary 
radial  point  A)  of  record  X,  where  X  is  either  an  edge  group  or  a  mesh 
block.  The  positions  of  all  records  are  defined  in  terms  of  p(l   ). 


p(M1U/L)   =  Pil^)   +  T 


p^1)   =  p^)  *  T 
pd.^)   =  p^)  ♦  1 

p((i,i))   =   pdn^)  +  2 

p((i,J+l))  -  p((i+l,j))  =  p((i,j))  +  T 

Besides  these,  we  add  the  requirement 

p((i,d+k))  =  p((i,j))  +  1  (=p((i+k,j))   . 

which  relates  the  grouping  constant  k  to  the  length  of  the  logical  track. 
It  defines  the  logical  track  to  be  of  length  (kT-l)  blocks. 

Implementation  of  the  simplified  structure  of  disk  used  in 
Figure  k   on  an  actual  disk  with  1200-segment  tracks  will  be  taken  up 
in  a  later  section.   For  now  we  will  assume  zero  head-switching  time,  and 
proceed  to  describe  the  sequence  of  I/O  transactions  for  sweeping  the 
mesh. 
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3.   SWEEPING  A  MESH 

3.1  Normal  Mode 

If  we  are  to  sweep  the  mesh  by  rows,  we  wish  to  read  into  fast 
memory  the  blocks  (l,l),  (1,2),  (l,3),  •  ••,  (l,n),  (2,l),  (2,2),  ...  in 
sequence.   Ignoring  edge  value  transmissions  for  the  time  being,  we  read 
block  (l,l)  from  logical  track  i  in  Figure  k   as  the  read/ write  head  moves 
from  left  to  right.   Let  calculations  on  block  (l,l)  begin  as  the  R/W  head 
passes  the  point  p((l,2))-l  (point  p(x)  is  the  beginning  of  position  p(X)). 
While  calculations  proceed  on  block  (l,l),  block  (l,2)  is  read.   Since  we 
intend  to  read  each  block  of  row  1  of  mesh  as  it  passes  by,  the  kernel 
must  finish  the  calculations  on  each  block  within  the  period  T.   We  will 
write  the  updated  block  (l,l)  in  position  p((l,3))-l  (on  another  logical 
track  if  necessary).  At  the  point  p((l,3))-l  we  begin  calculations  on 
block  (l,2).   Immediately  after  writing  the  new  block  (l,l)  we  read  block 
(l,3).   In  general,  we  read  block  (i,j),  write  updated  block  (i,j-l), 
read  (i,j+l),  write  (i,j),  etc.  in  sequence.  There  will,  of  course,  be 
a  "hiccup"  at  the  end  of  each  mesh  row,  except  in  one  case  to  be  discussed 
later.   If  necessary,  we  spend  a  whole  or  part  of  a  disk  rotation  of  latency 
to  insure  that  the  finishing  of  one  row  does  not  interfere  with  the  begin- 
ning of  the  next  row. 

After  sweeping  the  whole  mesh  in  this  way,  the  updated  mesh  will 
be  stored  on  disk  in  the  same  configuration,  but  shifted  2T-1  blocks  to 
the  right.   Column-sweeping  (i.e.,  (l,l),  (2,l),  (3,1),  •••,  (m,l),  (l,2), 
(2,2),  ...)  results  in  the  same  shift.   It  is  evident  that  we  may  sweep 
by  rows  or  columns  any  number  of  times  in  any  order. 
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We  now  superimpose  edge  group  transmissions  upon  the  I/O  sequence 
just  described.   The  term  "reading  normal"  will  be  applied  to  sweeping  the 
mesh  by  rows  with  edge  values  grouped  for  row-sweeping  (as  in  Figures  3 
and  k)   and  to  sweeping  by  columns  with  edge  values  grouped  for  column- 
sweeping.   The  other  two  possibilities  are  column-sweeping  with  row-grouped 
edges  and  row-sweeping  with  column-grouped  edges;  the  term  "reading  trans- 
posed" is  applied  to  these  types  of  sweeps. 

In  reading  normal  by  rows,  edge  values  to  the  left  and  right  of 
a  block  (as  viewed  in  Figure  3)  are  input  automatically  because  of  the 
sequence  of  block  reading,  as  mentioned  before.   Similarly  for  reading 
normal  by  columns,  edge  values  below  and  above  a  block  are  input  automat- 
ically.  Edge  values  below  and  above  for  row-normal  reading  and  left  and 
right  for  column-normal  reading  will  be  input  from  the  edge  area  on  disk. 
The  procedure  for  row-normal  reading  will  be  shown.   Column-normal  reading 
is  analogous:   edge  group  superscript  U  (upper)  is  replaced  by  R  for  "right" 
and  L  (lower)  is  replaced  by  L  for  "left". 

Row-normal  reading  is  shown  in  Figure  5.  This  chart  is  not  a 
storage  map  as  in  Figure  k.      Figure  5  is  read  a  line  at  a  time  from  left 
to  right.   Each  numbered  line  represents  one  logical  disk  revolution. 
Each  entry  on  a  line  represents  the  transmission  of  a  record  to  or  from 
the  disk.   Entries  in  rectangles  are  written  on  disk,  other  entries  are 
read  from  disk.  Horizontal  position  corresponds  to  radial  position  on 
a  logical  track  exactly  as  in  Figure  h.      Each  read  entry  is,  of  course, 
in  the  position  indicated  by  Figure  k.      It  may  be  helpful  to  align  Figure 
5  on  Figure  h. 
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Only  the  I/O  sequence  for  one  row  of  mesh  blocks  need  he  described 
since  the  relationship  between  positions  of  edge  groups  required  for  that 
row  and  positions  of  blocks  in  that  row  are  independent  of  row  number. 
We  will  follow  through  the  sequence  for  row  2  since  row  1  is  a  slightly 
degenerate  case. 

On  revolution  1  we  must  initialize  the  calculation  by  reading 
the  first  upper  edge  group  of  row  1  I_   and  the  first  lower  edge  group 
of  row  3  III-,  .   With  this  data  in  fast  memory  we  may  read  and  update  the 
first  k-2  blocks  of  row  2,  since  I,   contains  edges  from  the  1st  k-2  blocks. 
We,  therefore,  start  reading  the  blocks  sequentially  from  (2,1)  on  revolu- 
tion 1  and  write  each  block,  updated,  2T-1  disk  blocks  after  it  is  read. 
This  eventually  brings  us  back  to  radial  point  A  where  revolution  2  begins. 
We  read  edge  group  I~  ,  which  contains  edges  below  blocks  (2,k-l)  to 
(2,2k-2),  as  it  goes  by.   Notice  that  we  have  read  the  edge  below  (2,k-l) 
well  before  we  actually  need  it.  Later  on  in  revolution  2  we  read  IIIp 
which  contains  edges  above  blocks  (2,k)  to  (2,2k-l).   The  edge  above  block 
(2,k)  is  read  in  at  p((2,k))  +  T-2;  but  this  is  acceptable  since  calcula- 
tions on  (2,k)  are  not  started  until  the  point  p((2,k))  +  T-l. 

Besides  writing  updated  blocks  as  we  move  along  the  disk,  we 
must  write  updated  edge  groups.  These  groups  must  be  written  in  positions 
such  that  the  disk  configuration  of  the  updated  mesh  and  edge  groups  is 
the  same  as  Figure  k.      Hence,  we  must  write  the  updated  edge  group  II, 

2T+2  disk  blocks  before  the  position  of  the  updated  block  (2,l),  and 

L '  /    \ 

II    T+2  disk  blocks  before  updated  block  (2,1).   Fortunately,  so  to 

speak,  there  are  no  other  activities  which  must  occupy  the  R/W  head  in 
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*  U'        L' 

these  positions  on  revolution  2  ;  and  the  fact  that  II,   and  II,   must 

contain  updated  edges  from  blocks  (2,k-2)  and  (2,k-l)  respectively  is  no 
problem,  since  those  blocks  will  have  been  updated  by  the  time  the  corre- 
sponding edge  groups  need  be  written.   There  must,  of  course,  be  buffers 
in  fast  memory  for  accumulating  updated  edge  groups  prior  to  transmissions 
to  disk. 

On  revolution  3  the  same  procedure  is  followed  with  all  trans- 
missions shifted  to  the  right  by  one  block,  all  subscripts  incremented  by 
1  and  all  j-values  of  blocks  (i,j)  incremented  by  k.   With  each  revolution 
transmissions  shift  right  one  disk  block  by  the  fact  that  we  have  arranged 
the  logical  track  to  be  kT-1  blocks  long  and  mesh  block  transmissions  have 
period  T. 

Earlier  we  defined  I     such  that 

Un-l)k-2  <  n  <  ink-2   . 


Let  us  say  that  n  =  £   k-2.   Then  the  last  transmissions  for  row  2  of  mesh 

n 

appear  as  shown  on  revolution  I   +1  in  Figure  5.   If  n  <  i  k-2,  then  some 

n  n    ' 

mesh  block  transmissions  disappear,  but  updated  edge  groups  II,   and 

L-  n 

II,   must  still  be  written  in  the  positions  shown.   Processing  of  row  3 

n 
on  some  revolution  r~  may  begin  after  row  2  is  finished.   We  may  have 

r~  =  I     +  2;  however,  in  this  case  we  spend  more  than  a  revolution  of 


We  do  not  attempt  to  take  advantage  of  the  machine's  capability  to  perform 
two  transmissions  simultaneously.   This  capability  arises  from  having  two 
electronics  units,  each  with  half  of  the  total  number  of  disk  storage 
units.  As  a  result,  any  simultaneous  transmissions  must  occur  in  opposite 
halves.   We  do  not  attempt  to  meet  this  restriction. 


Ik 


latency  (kT-1  +  p((3,l))  "  (p((2,i  k-2))  +  T)  disk  blocks).  We  would 

like  to  have  r  =  £     +•  1.   This  cannot  be  done  with  the  sequence  shown 

in  Figure  5  "because  of  conflicts  in  p(ll,  )  and  p((3,l))«  The  positions 

of  activity  on  revolution  &     +■  1  are  for  £     =  T-l  with  T  =  5.   If  £     were 

n  n  n 

5,  line  £     +■  1  would  be  shifted  one  block  right,  and  we  could  have  r  =£   +1. 

It  is  clear  that  the  value  of  £     is  critical  in  determining  latency. 

Acceptable  values  of  1  ,  i.e.,  those  for  which  r_  may  equal 

£     +   1,  can  be  found  in  general  by  comparing  activity  in  revolution  r 

to  activity  in  revolution  £     +  1.   Take  p(ll,  )  =  0.   Then  revolution  r_ 

n  r   1  3 

has  activity  in  positions  0,  3T,  3T+2,  ^T+2,  5T+1,  5T+2,  ....   We  also 
find  positions  of  activity  for  revolution  £     +   1  by  first  determining 


p((2,  ink-2)): 

p((2,l))   =  2T  +  2 
p((2,k))   =  p((2,l))  +  1  "  T 
p((2,k-2))   =  p((2,k))  -  2T 
p((2,jeQk-2))    =  p((2,k-2))  +  (£n   -   1). 

Therefore,  p((2,  £   k-2))   =  2  +  £      -  T.   Other  activity  in  revolution 
'  ^  '     n  n 

#  +  1  can  be  determined  from  this  fix.   Positions  of  activity  are 

summarized  in  Table  1. 


(2,  -2nk-2) 


1  +  i  +  T 
n 


II 


U» 


-2  +  J  +  2T 
n 


II 


L» 


•2  +  i  +  3T 

n 


15 


ENDING  ROW  2 


STARTING  ROW  3 


ACTIVITY 


POSITION 


ACTIVITY        POSITION 


(2,    ^nk-U) 


2  +  in  -  3T 


II 


U 


0 


(2,  ^nk-5) 


1  +  ^   -  2T 


n 


IV. 


3T 


(2,  ^k  -3) 


2  +  i   -  2T 


(3,1) 


3T  +  2 


(2,  ink-^) 


1  +  i   -  T 


n 


(3,2) 


hi  +   2 


(2,  i  k  -2) 
\  ,   n 


2  +  i   -  T 


(3,1) 


5T  +  1 


(2,  ink  -3) 


1  +  i 


n 


(3,3) 


5T  +  2 


Table  1.   Positions  of  activity  for  row  transition 
in  row-normal  reading. 


16 


We  will  not  take  into  account  any  activity  before  p((2,  £  k-k)) 
in  ending  row  2  or  after  p((3,3))  in  beginning  row  5.  This  will  lead  to 
an  analysis  which  is  correct  for  2  +  £     -  3T  <  0  and  -2  +  £     +   3T  <  5T  +  2; 
or  I  <  3T  -  2  and  £     <   2T  +  k.      It  has  not  yet  been  mentioned  that  the 
scheme  proposed  will  work  only  for  T  >  5.   We  look  for  acceptable  values 
of  £     <   10,  which  satisfies  both  inequalities  above  for  T  >  5.  A  value 
is  acceptable  if  the  two  sets  of  positions  in  Table  1  are  disjoint.   The 
sets  are  disjoint  for  £     ~  T,  for  example.  £     =   T-3  is  unacceptable  for 
T  =  5,1,   acceptable  otherwise  (T  >  5).  I     -   T-2  is  always  unacceptable. 

In  general,  a.  value  can  be  tested  for  acceptability  by  substituting 
that  value  into  the  left  column  and  comparing  each  value  obtained  with  all 
of  the  elements  in  the  right  column,  for  all  values  of  T.   Whether  one  sub- 
stitutes numbers  or  expressions  in  T  for  I   ,  it  is  a  painful  task.   Table 
2  shows  acceptable  numerical  values  of  £     versus  T.   For  any  numerical 

value  of  £    .   a  minimum  K„  can  be  found  such  that  for  T  >  K„  the  £     is 

£n  -    £n 

either  always  acceptable  or  always  unacceptable.   L.  =  13:  £     =  10  is 
acceptable  for  T  >  13.   K.  <  IC  ,  i  -   1,  . ..,  9.      The  table  is  therefore 
not  shown  for  T  >  13.   Table  2,  along  with  a  corresponding  table  for  trans- 
posed reading,  will  be  useful  in  finding  a  scheme  with  minimum  latency 
for  a  given  problem. 
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5 

6 

7 

8 

9 

10 

11 

12 

13 

1 

A 

A 

A 

A 

A 

A 

A 

A 

A 

2 

3 

A 

A 

A 

A 

A 

A 

A 

A 

1+ 

5 

A 

A 

A 

A 

A 

A 

A 

6 

A 

A 

A 

A 

A 

A 

A 

7 

A 

A 

A 

A 

A 

A 

8 

A 

A 

A 

A 

A 

9 

A 

A 

A 

A 

A 

10 

(A) 

A 

A 

A 

A 

Table  2.  Acceptable  values  of  i  for  normal  reading. 
"A"  indicates  acceptable;  blank  indicates 
unacceptable.  £  =  £     or  £   .   Parentheses 
indicate  value  which  is  unacceptable  for 
transposed  reading. 
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Only  row-normal  reading  has  been  considered  in  this  section;  but 
it  is  clear  that  all  of  the  results  apply  also  to  column-normal  reading 
with  column  parameters  replacing  corresponding  row  parameters --namely,  I   , 
m,  and  edge  group  superscripts  R  and  L.  As  mentioned  before,  k  is  the 
same  in  both  directions. 

We  now  examine  the  problem  of  changing  over  from  sweeping  by  rows 

to  sweeping  by  columns.   Note  in  Figure  5  that  when  updated  edge  groups 

U'        L» 
II.   and  II.   were  written  on  disk,  the  upper  and  lower  edges  of  groups 

of  k  blocks  in  a  row  were  written.  This  prepares  the  disk  storage  structure 

for  a  subsequent  sweep  by  rows.   If  the  next  sweep  is  to  be  by  columns, 

however,  we  store  on  disk  the  updated  right  and  left  edges  of  the  blocks 

in  the  row  rather  than  the  upper  and  lower  edges  respectively.   We  will 

R'T        L'T      R'T  U' 

use  the  notation  II.   '  and  II.    .II.    is  written  in  place  of  II.   , 

1         11  ^         l  ' 

L'T  L' 

and  II.    is  written  in  place  of  II.   .  The  problem  arises  that  these 

R'T 
right  and  left  edges  are  still  grouped  by  rows;  i.e.,  II.    contains 

L  'T 
updated  right  edges  from  blocks  (2,  (i-l)k-l)  to  (2,  ik-2),  and  II. 

■X- 

contains  updated  left  edges  from  blocks  (2,  (i-l)k  )  to  (2,  ik-l).   Edges 
in  all  of  the  R.   '  and  R.    are  grouped  in  the  same  way.   Since  these 
edges  are  not  grouped  for  column-normal  reading,  a  different  I/O  sequence 
will  be  required.   This  new  sequence  will  be  called  column-transposed 
reading.   Instead  of  reading  k  edges  from  one  place  in  a  logical  disk 
revolution,  we  will  read  the  edges  one  at  a  time  from  many  points  in  a 


* 


It  may  be  helpful  to  imagine  the  individual  edges  (k  per  edge  group)  in 
Figure  3  to  be  rotated  clockwise  by  ninety  degrees. 
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revolution,  so  that  the  flow  of  edge  values  into  fast  memory  becomes  a 
quasi-continuous  process  as  opposed  to  the  single  transmissions  in  Figure  5. 

3.2  Transposed  Mode 

Figure  6,&  and  b,  illustrates  the  procedure  for  column-transposed 
reading.  The  mesh  is  assumed  to  be  mapped  on  disk  as  in  Figure  h;  except 
that  upper  edge  groups  R.   have  been  replaced  with  right  transposed  edge 

groups  R.   and  lower  edge  groups  R.   have  been  replaced  with  left  trans- 

— LT 
posed  edge  groups  R   . 

A  sweep  on  column  2  will  be  used  as  a  first  example.   The  first 

block  to  be  input  is  (l,2),  for  which  we  need  the  right  edge  of  (l,l)  and 

the  left  edge  of  (1,3).   The  right  edge  of  (l,l)  is  contained  in  edge  group 

pm  pm 

I   ;  specifically,  it  is  the  third  edge  in  I   ,  counting  the  two  empty 

edge  spaces  overlapping  the  mesh  in  Figure  3.   Likewise,  the  left  edge  of 

T  T 
(l,3)  is  the  fourth  edge  in  I    .   The  edges  needed  for  block  (2,2)  are 

pm  t  m 

the  third  and  fourth  edges  in  edge  groups  II    and  II,    respectively. 
In  order  to  move  up  column  2,  we  must  have  one  edge  each  from  all  edge 
groups  with  subscript  one.   Figure  6a  shows  "simultaneous"  transmissions 

■pm 

of  single  edges  from  pairs  of  edge  groups.   An  edge  from  II    and  an  edge 

LT 
from  I1       ,    for  example,  are  to  be  read  within  the  space  of  one  disk  block. 

The  reading  head  will  be  required  to  switch  tracks  within  a  disk  block, 

and  the  edges  within  the  groups  involved  must  be  arranged  such  that  no 

pair  of  edges  required  in  any  transmission  have  the  same  radial  position. 

This  matter  will  be  taken  up  in  a  later  section;  for  now  we  assume  the 

requirement  to  be  satisfied. 
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As  each  successive  column  is  swept,  the  beginning  of  the  mesh 
block  transmissions  moves  T  disk  blocks  to  the  right  because  of  skewing. 
The  edge  value  transmissions,  however,  start  at  the  same  radial  position 
up  to  and  including  the  sweep  on  column  k-2.   We  are  reading  edges  farther 
and  farther  ahead  of  corresponding  mesh  blocks.   When  mesh  blocks  have 
finally  shifted  around  an  entire  logical  track,  we  will  be  unnecessarily- 
reading  edges  a  disk  revolution  ahead  of  the  mesh  blocks  for  which  the 
edges  are  needed.  At  this  point  we  may  shift  to  reading  edges  immediately 
before  corresponding  blocks.   Such  a  shift  is  executed  in  two  stages  in 
Figure  6b. 

The  first  stage  of  the  shift  is  executed  at  column  k-1.   Only 
the  transmissions  of  left  edges  from  mesh  blocks  in  column  k  are  shifted. 
Transmissions  of  right  edges  of  column  k-2  are  still  started  a  revolution 
ahead  of  mesh  block  transmissions.   Column  k-1  is  a  special  case,  as  its 
eastern  neighbor  edges  are  now  in  edge  groups  with  subscript  2  while  its 
western  neighbor  edges  are  still  in  edge  groups  with  subscript  1.   It  would 
be  possible  to  shift  transmissions  of  both  left  and  right  edges  at  column 
k-1;  however,  shifting  the  left  edges  only  results  in  a  greater  freedom 
with  the  parameter  I   ,   when  one  considers  the  efficiency  of  the  transition 
from  column  k-2  to  column  k-1. 

The  second  stage  of  the  shift  is  executed  at  column  k.   Now  all 
edges  needed  are  in  edge  groups  with  subscript  2.   The  sweep  up  column  k+1 
is  similar  to  the  sweep  up  column  1,  except  that  column  1  has  no  western 
neighbor  edges.   Likewise,  the  sweep  up  column  k+2  is  similar  to  the  sweep 
up  column  2;  and  in  general  the  sweep  up  column  j  is  similar  to  the  sweep 
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up  column  j  modulo  k.   Columns  nk-1,  n=l,2,...,i!  -1,  all  require  edges 
from  edge  groups  with  different  subscripts. 

When  the  updated  edge  values  are  written,  they  will,  of  course, 

—  L' 
he  grouped  in  the  column  direction.   If  left  and  right  edge  groups  R. 

—  R'   —  — 

and  R.   ,  R  =  I,  II,  ...,  n  ,  are  written,  then  the  updated  mesh  will  be 

organized  for  column-normal  reading.   If  the  subsequent  sweep  were  to  be 

—  L'T  —  U'T 

by  rows,  lower-transposed  R.    and  upper  transposed  R.    would  be  written 

instead.   If  R  indicates  a  sweep  by  rows  and  C  a  sweep  by  columns,  then 

sequences  of  the  type  (RC)  will  have  every  read  in  the  transposed  mode. 

Sequences  of  the  type  (RRCC)  will  have  alternating  normal  and  transposed 

reading  sequences. 

Merging  the  transmissions  at  the  end  of  one  column  with  the 

transmissions  at  the  beginning  of  the  next  column  is  slightly  more  complex 

than  for  reading  normal  in  that  edge  transmissions  at  the  beginning  of  a 

column  may  penetrate  very  deeply  into  the  transmissions  for  the  preceding 

column.   The  transitions  from  column  k-2  to  column  k-1  and  from  column  k-1 

to  column  k  are  worst  cases.   Both  of  these  transitions  will  be  inspected. 

Figure  7  illustrates  the  ending  of  column  k-2  and  the  beginning 

RT 
of  column  k-1.   Let  the  reference  position  p((l,   ))  be  zero.   For  starting 

— —  RT 
column  k-1,  there  is  activity  in  positions  0,  T,  2T,  ...,  p((k+l   ) )  =  kT 

mod(kT-l)  =  1,  p((l,k-l))  =  3,  1+T,  3+T,  1+2T,  2+2T,  3+2T  =  p((3,k-l)). 
Positions  of  activity  for  ending  column  k-2  may  be  computed  by  first  obtain- 
ing a  fix  on  block  ((£   -l)k+l,k-2): 

p((l,k-2))  =  p((l,k-l))-T  =  3-T 

p((Um-l)kH,k-2))  =  p((l,k-2))  4-  (im-l)  =  2  +  im  -  T   . 
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We  must  compare  revolutions  A  with  C  and  B  with  D.   Note  that  for  the  first 
comparison,  we  need  only  compare  activity  between  markers  M  and  11  on  A 
with  the  first  activity  on  C  because  of  the  periodicity  T  of  activities 


and  the  facts  that  p(|  {£   -l)k,k-2  I)  =  1  +  I    >   p((ln   ))  and  p((i  k-2n   ))  = 

m  /  '  m      1  m   1  '' 


£   -2T  <  p((ln   ))  for  i  <  2T.   We  will  again  look  for  values  £     <  10. 
m    -     1   '      m—  D  m  — 


For  the  comparison  of  B  with  D,  note  that  p(|  £   k-2,k-2  I)  = 


m 


2+i  -2T  <  p((l,k-l))  =  3  for  £     <   2T;  so  that  we  need  only  take  account 

of  possible  conflicts  between  the  two  updated  edge  group  transmissions  on 

B  and  the  mesh  block  transmissions  on  D.   Transmissions  to  the  left  on  B 

and  D  will  be  compatible  if  transmissions  to  the  right  on  A  and  C  are 

compatible.  Also  the  two  updated  edge  group  transmissions  on  B  will  not 

conflict  with  edge  transmissions  on  D  if  updated  edge  group  transmissions 

on  A  do  not  conflict  with  (ln   )  for  £     <   2T.   Of  the  mesh  block  transmissions 

1        m  — 


on  D  we  need  only  consider  (l,k-l)  and  (2,k-l)  since  the  next  one,  |  l,k-l 


is  always  to  the  right  of  the  last  activity  of  B  for  i  <  2T.  Actually, 
we  need  only  consider  the  last  transmission  on  B. 

Table  3  lists  the  activities  of  interest  in  making  the  column 
transition.   Sets  A  and  C  must  be  disjoint  and  sets  B  and  D  must  be  disjoint. 
If  we  construct  a  table  of  acceptable  values  such  as  Table  2,   we  find  that 
there  is  only  one  unacceptable  value  pair  (£   ,T)  which  is  not  on  Table  2. 
That  pair  is  £     =   10,  T  =  5.   If  we  delete  the  corresponding  "A"  from 
Table  2,  we  have  a  table  of  acceptable  values  for  both  normal  and  trans- 
posed reading. 
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STARTING  COLUMN  k-1 


A 


^uuuiui     xv— c: 

C 

ACTIVITY 

ft") 

ACTIVITY 

POSITION 

i    -2T 
m 

1+jg    -2T 

m 

POSITION 
0 

(im-l)*-2, 
k-2 

((ln-l)k,k-2)       2^m-2T 


k-2 


R« 

V1 


-2+i  -T 
m 


(i  -l)k-l, 
m 

k-2 


1+i  -T 
m 


((i  -l)k+lfk-2)        2+im-T 


k-2 


L' 

V1 


k-2 


L' 


m 


-2+i 


m 


-1+i 


m 


D 


(l,k-l) 

(2,k-l) 


3 

3+T 


Table  3«   Positions  of  activity  for  column  transition 
k-2  to  k-1  in  column-transposed  reading. 
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One  might  question  the  statement  that  we  have  obtained  all 

unacceptable  pairs  (i,T)  for  transposed  reading  since  we  have  considered 

only  one  transition  out  of  k  different  transitions.   In  fact,  it  can  be 

verified  that  we  have  all  unacceptable  values  for  I    .   £     <  10  and  T  >  5. 

e  nr   n  —         — 

Having  investigated  most  of  the  logical  aspects  of  the  scheme 
under  consideration,  we  now  consider  the  problem  of  implementing  the  scheme 
on  a  disk  storage  unit  with  1200  addressable  segments  per  track. 
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k.      IMPLEMENTING  THE  SCHEME 

4.1  General 

For  mesh  storage  in  one  quadrant  of  ILLIAC  IV,  it  is  efficient 
to  store  8x8  squares  of  mesh  points  in  "quadrant  words"  across  the  6k 
processing  elements.   For  this  reason,  it  will  be  assumed  that  the  mesh 
is  subdivided  into  8x8  squares;  and  mesh  blocks  will  have  dimensions  of 
p  X  q  squares,  p  and  q  integers.   The  smallest  addressable  piece  of  data 
on  the  disk  is  the  segment,  which  consists  of  256  words,  or  h   quadrant 
words.   The  head-switching  time  of  the  disk,  again,  is  taken  as  two  segments. 

We  construct  each  logical  track  pictured  in  Figure  k   from  t  disk 
tracks  for  some  integer  t.   In  doing  so,  we  do  not  take  advantage  of  the 
t-1  times  an  actual  radial  position  is  passed  within  each  logical  revolution. 
If  b  is  the  number  of  segments  in  a  disk  block,  then 

b(kT-l)  <  1200t 

must  be  satisfied.   We  might  try 

segments, 


Il200t| 
[kT-lJ 


where  the  operator  L   J  indicates  the  greatest  integer  less  than  or  equal 
to  the  argument.   The  truncation  will  result  in  wasting  a  number  of  seg- 
ments; we  allow  this  if  the  number  of  unused  segments  is  reasonably  small. 
These  wasted  segments  become  a  dead  area  on  disk  and  will  never  be  used 
for  storage.   The  dead  area  will,  of  course,  contribute  to  overall  latency. 
Of  the  b  segments  in  a  disk  block,  we  may  use  b-2  segments,  with 
2  segments  at  the  end  or  beginning  of  the  block  for  head  switching.   If  b 
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does  not  divide  1200,  then  some  disk  blocks  may  straddle  track  connections 

and  we  will,  in  general,  have  to  allow  2-segment  spaces  at  each  of  the 

t-1  connections  of  the  disk  tracks.   If  the  dead  area  is  not  large  enough 

to  cover  this,  one  might  reduce  b  or  try  another  combination  of  k  and  t. 

In  an  actual  mesh  there  will  be  a  number  of  variables  associated 

with  each  mesh  point.   Let  this  number  be  N  .   We  must  have 

v 


N  pq<  i+(b-2)   . 


v 
In  addition,  if  the  mesh  has  dimensions  M  X  N  8x8  squares, 

M  <  pm 
and  N  <  qn 

where  m,n  are  dimensions  in  mesh  blocks. 

The  I   ,  I     satisfy 
nr   n 

U  -l)k-2  <  m  <  i  k-2 
m  —  m 

U  -l)k-2  <  n  <  i  k-2 
v  n  —  n 

and  they  should  be  acceptable  as  defined  by  Table  2. 

Noting  that  there  are  k  edges  per  edge  group,  one  edge  must  be 
contained  in 


M 


segments 


for  the  reason  that  individual  edges  must  be  addressable  for  transposed 
reading.  An  edge  consists  of  N  d(8p)  or  N  d(8q)  points.   Hence 


8N  d  Max(p,q)  <  256 


b-2 
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The  number  of  segments  required  for  an  edge  is 


8N  d  Max(p,q)  +  255 


If  it  is  not  necessary  to  change  the  direction  of  sweeping  over 
the  mesh,  then  edges  need  not  be  individually  addressable,  and  the  require- 
ment is  less  stringent: 

8N  d  s  k  <  256 (b-2) 

where  s  is  p  or  q  depending  on  the  direction  of  sweeping. 

In  the  description  of  transposed  reading,  it  was  assumed  that 
when  two  edges  from  different  edge  groups  were  required  within  the  space 
of  one  disk  block,  the  two  edges  would  not  be  in  the  same  radial  position. 
In  fact,  the  two  edges  must  be  separated  by  at  least  two  segments.  For 
k  >  6  this  can  be  guaranteed  as  follows. 

Observe  from  Figure  8  that  for  sweeping  up  column  (i-l)k+j-2 
we  must  read  the  (j-l)th  edge  from  a  superscript  RT  edge  group  and  the  jth 
(modulo  k)  edge  from  a  superscript  LT  edge  group.   Each  edge  is  mapped  into 
an  "edge  slot"  on  a  logical  track  as  shown.  All  LT  edge  groups  begin  with 
the  (k-l)th  edge  in  the  first  edge  slot;  and  all  RT  edge  groups  begin  with 
the  1st  edge  in  the  first  edge  slot.   Then  the  edges  required  for  trans- 
mission are  separated  by  two  edge  slots  for  k  >  6.   The  smallest  edge  slot 
possible  is  one  segment.   In  moving  up  column  ik-1,  we  need  the  kth  edge 


RT    ,  ^      .,     ,        =r  LT 


from  R+k+1.  '  and  the  1st  edge  from  R.    ,  for  which  the  requirement  is 
still  satisfied.   If  the  edge  slot  is  two  segments  or  more,  then  k  >  h 


I    I 


28 


1 

1 
1 

J*: 

wmm 

i 

-* 

~    CVJ 


_     t 


CM 
I 


gSSSSSS 

'-3 


2 
C 
Z) 
-i 
o 
o 


-J 


U) 

e 

UJ 

w 


l 

—3 


CM 


.  lil    w 

CVi 


r-o 


CM 


CVI 


CM 


O 
(it 

3 


II       |      j.| 

i  £  i    i  -j  i 


w 
•H 

Tl 

a 
o 

w 

A3 

CJ 

o 

H 

o 

taD 

<l> 

cu 
o 

■s 

•H 


s 

•H 

I 


CO 
0) 

•H 


I    I 


I 


29 


is  sufficient.   Of  course,  the  unused  portion  of  the  b-2  segments  allotted 
might  be  distributed  between  edge  slots  to  achieve  separation. 

No  mention  has  yet  been  made  of  having  T  be  some  value  >  5  other 
than  an  integer.   This  can  certainly  be  done  without  upsetting  the  logic 
of  the  I/O  sequence.  Another  point  concerning  T  should  be  considered, 
however. 

Recall  that  T  is  the  ratio  of  the  allowed  compute  time  for  a  mesh 
block  to  the  one-way  transmission  time  for  a  disk  block  of  b  segments.   T 
is  the  logical  period  of  the  scheme,  and  it  is  a  parameter  of  great  interest 
to  us.   When  dealing  with  a  calculation  kernel,  however,  we  are  more  inter- 
ested in  the  ratio  of  the  allowed  compute  time  to  the  input  time  for  a 
mesh  block,  which  is  always  less  than  a  disk  block  by  2  segments  or  more. 
Let  this  ratio,  the  real  period,  be  devoted  by  T  .   N  pq/^  segments  are 
used  to  store  a  mesh  block;  therefore 

UTb 


T 


R     N  pq 
v 


Figure  9  presents  the  relationship  graphically. 

Both  T  and  1L  are  measures  of  the  same  quantity,  the  amount  of 
time  available  for  computation  on  a  mesh  block.   They  are  measures  in 
different  units,  however,  as  indicated  in  Figure  9.  We  are  more  interested 
in  TR  than  in  T  because  kernel  times  can  be  given  in  mesh  blocks.   By  mesh 
block  units  of  time,  we  mean,  of  course,  the  transmission  time  for  a  mesh 
block  of  N  pq/U  segments.  Although  a  mesh  block  is  actually  transmitted 
in  [  (N  pq+3)AJ  segments,  we  are  interested  in  the  time  to  transmit  that 
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part  of  L(N  pq+3)/^J  segments  containing  data.   It  is  obvious  that  T  is  a 

lower  bound  on  T  and  that  T  cannot  equal  T  because  of  the  two  segments 
R  K 

reserved  for  head  switching. 

With  every  kernel  there  can  be  associated  a  number  T^  which  is 

the  ratio  of  the  calculation  time  per  point  to  the  input  time  per  point. 

T„  is  the  calculation  time  per  point  normalized  to  the  disk  transmission 
K 

rate;  it  is  the  kernel  time  in  mesh  blocks.   It  is  assumed  that  T  is 

independent  of  the  size  of  the  mesh  block  which  the  kernel  is  updating; 

although  if  the  compute  time  varies  slightly  because  of  I/O  interrupts, 

etc.,  TT.  should  be  an  upper  bound  on  the  compute  time.   We  must  have 

1V  <   T^;  and  we  wish  to  have  T„  as  close  to  T^  as  possible  in  order  to 
JK.  —   K  K  iv 

match  the  overall  transmission  rate  of  the  scheme  to  the  speed  of  the 
kernel.   If  several  different  kernels  are  applied  to  the  mesh,  then  we 
match  the  scheme  parameter  T  to  the  slowest  kernel,  since  it  appears  that 
the  parameters  of  the  scheme  cannot  be  changed  during  a  sweep. 

Given  a  mesh  of  dimensions  M  X  N  8x8  squares  with  N  variables 
per  point,  a  stencil  depth  d,  and  a  normalized  kernel  time  TK,  the  problem 
is  to  find  scheme  parameters  for  which  the  latency  of  the  disk  and  the 
value  T„  -  T„  are  as  small  as  possible.   In  reality,  T„  -  T„  is  part  of 

K     iv  K     iv 

the  overall  latency  since  for  T_  >  T^  the  data  on  disk  is  "not  there" 

exactly  when  we  are  ready  to  compute  on  it.   The  value  T_  -  TT/r  can  be 

K     iv 

thought  of  as  the  ratio  to  Tb  segments,  of  the  number  of  segments  for 

which  the  computer  is  waiting  for  the  disk  to  send  another  mesh  block. 

In  spite  of  this,  the  present  discussion  will  distinguish  between  overall 

latency  and  the  value  IL  -  TT_. 

K     iv 
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k.2     Measuring  Latency 


The  measure  T  -  Tv   represents  latency  which,  in  a  sense,  can  he 

made  to  go  away  hy  increasing  T„.   It  is  not  proposed  to  increase  the  com- 

K 

plexity  of  a  given  kernel  for  no  reason;  but  rather  if  a  solution  of  the 

scheme  parameter  relations  exists,  then  one  might  look  for  a  problem  with 

T  close  to  the  T  for  the  scheme.   T  -  T  might  be  called  external  latency. 

The  question  now  is:   "What  is  internal  latency?"  We  can  include 
the  wasted  time  between  complete  sweeps  of  the  mesh,  the  "hiccups"  between 
sweeps  along  rows  or  columns  of  the  mesh,  latency  due  to  incompletely- 
filled  blocks  on  the  last  row  or  column,  and  the  time  spent  skipping  over 
the  dead  area  on  disk.   The  last  three  elements  will  be  discussed  here. 

The  size  of  the  dead  area  on  each  logical  track  is  1200t-b(kT-l) 
segments.   Since  it  is  passed  on  each  logical  revolution,  the  percent  of 
total  time  spent  on  the  dead  area,  i.e.,  the  dead  area  latency,  is 

1200t-b(kT-l) 
D  "      1200t 

It  is  unlikely  that  any  code  could  do  useful  work  to  mask  the  dead  area 

latency  since  the  dead  area  is  not  distributed  across  the  disk  blocks  in 

a  logical  track,  and  furthermore  since  it  moves  left  by  2T-1  disk  blocks 

relative  to  the  mesh  with  each  complete  sweep. 

The  "hiccups"  between  sweeps  along  rows  and  columns  will  cause 

latency  unless  n  =  I   k-2  and  I     =  3T,  and  likewise  for  m.   For  n  =  $,   k-2 

n        n  n 

there  will  be  3T-i  disk  blocks  of  latency  at  the  end  of  each  row.   In 

addition,  if  -  <  n  <  I   k-2,  we  may  not  be  using  all  of  the  £   k-2  mesh 
q  —   —  n   '  n 

N 
blocks  allowed.  Then  I   k-2  blocks  are  effectively  wasted.  For  every 
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block  so  wasted  we  incur  an  additional  T  blocks  of  latency.   This  occurs 

at  the  end  of  every  row,  out  of  disk  revolution  through  approximately 

I   (l200t)  +  Tb  segments.   The  latency  incurred  at  row  connections  is 
n 


therefore 


(3T-i  +  T(i  k-2  -  £))b 
LR  =      1    (1200t)  +  Tb 


n 


Similarly,  for  column  sweeping, 

.   <3T-lm  ♦  TUmk-2  -  g))l, 

LC         ,0  (1200t)  +  Tb 

m 


Additional  latency  occurs  on  the  last  row  for  row-sweeping  if 
M  <  pm.  The  dimension  m  should  be  the  smallest  integer  such  that  M  <  pm. 
This  latency,  like  the  latency  at  row  connections,  might  be  masked  by  use- 
ful calculation  on  the  boundaries  of  the  mesh.   Extra  computing  time  can  be 
provided  for  all  four  boundaries  by  strictly  enclosing  the  M  X  N  mesh  in 
the  m  X  n  blocks.   It  will  be  assumed  here,  however,  that  updating  calcula- 
tions on  boundary  points  are  no  more  time-consuming  than  those  on  interior 
points.   The  storage  available  in  the  last  row,  but  not  used,  amounts  to 
N  N(mp-M)/^  segments.   The  latency  occurs  once  in  m  rows;  its  value  is 
approximately 

lAN  N(mp-M)T„ 
T     -      v        R 
RL     m(i  (1200t)+Tb) 


34 


Similarly,  for  column-sweeping, 


l/4N  M(nq-N)TT3 
_     ^    V   K 

CL     n(i  (1200t)+Tb) 


In  order  to  determine  overall  internal  latency,  one  must  take 
account  of  the  order  of  sweeping.  For  an  equal  number  of  row  and  column- 
sweeps, 


L  =  LD  +  1/2(LR  4-  L^  +  Lc  f  LCL) 


is  a  reasonably  good  measure  of  overall  internal  latency. 

One  latency  term  has  not  been  included.   It  is  the  latency  spent 
re-initializing  between  complete  sweeps.   There  will  be  no  attempt  here 
to  calculate  it,  although  in  some  problems  it  may  be  important. 

4.3  Storage  Requirements 
4.3.1  Fast  Memory 

If  one  examines  the  sequence  of  I/O  for  sweeping  the  mesh,  he 
can  tally  all  mesh  blocks  and  edges  contained  in  fast  memory  at  every 
instant,  and  determine  the  amount  of  storage  needed  for  the  data  as  a 
function  of  time.   If  the  amount  of  storage  needed  for  program  and  scratch 
area  is  added,  and  the  maximum  over  time  of  the  total  storage  required  is 
determined,  one  can  state  whether  the  scheme  will  work  for  a  memory  of 
given  size.  Alternatively,  one  may  examine  storage  requirements  over  a 
sample  of  problems  and  problem  sizes,  and  attempt  to  estimate  the  amount 
of  fast  memory  required  for  a  particular  installation. 
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The  maximum  amount  of  fast  memory  required  for  storage  of  the 
mesh  data  is  a  function  of  the  mesh  and  scheme  parameters.   It  is  also, 
in  a  somewhat  odd  way,  a  function  of  the  organization  of  the  fast  memory 
itself,  and  of  the  size  of  the  smallest  addressable  segment  on  disk  and 
the  length  of  the  disk  track. 

The  storage  required  for  sweeping  in  transposed  mode  is  greater 
than  for  sweeping  in  normal  mode.   The  difference  is  of  the  order  of  an 
edge  group,  but  it  must  be  remembered  that  edge  groups  are  usually  larger 
if  a  transposed  read  is  required,  since  storage  for  individual  edges  rather 
than  edge  groups  is  rounded  upwards  to  the  nearest  disk  segment. 

Since  there  are  many  numerical  PDE  problems  which  do  not  require 
changing  the  direction  of  sweep,  it  is  worthwhile  to  investigate  the  require- 
ments for  normal  reading  separately  from  transposed  reading.   In  this  report 
only  normal  reading  will  be  analyzed. 

Because  of  the  parallel  structure  of  ILLIAC  IV,  special  problems 
arise  in  the  allocation  of  memory.   One  must  be  clever  in  the  design  of  the 
program  and  in  the  distribution  of  data  across  the  processing  elements.   One 
of  the  constraints  imposed  in  the  analysis,  that  of  considering  8x8  squares 
as  the  smallest  subdivision  of  the  mesh,  resulted  from  taking  account  of  the 
structure  of  the  fast  memory.   This  structure  also  causes  problems  with 
storage  of  edge  values.   If  an  edge  or  an  edge  group  is  packed  tightly  into 
the  smallest  number  of  quadrant  words  that  will  contain  it,  then  it  is  likely 
that  some  of  the  edge  points  will  not  be  located  in  the  proper  processing 
elements.  Additional  code  will  be  needed  to  route  data  to  proper  PE's  when 
the  data  is  needed.   The  space  saved  by  packing  the  edges  or  edge  groups 
may  be  used  up  by  the  additional  code  and  scratch  areas. 
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Nevertheless,  in  this  analysis  we  will  calculate  storage 
requirements  based  on  having  data  packed  moderately  tightly.   For  row- 
normal  reading  an  edge  group  consists  of  8N  kqd  words.   The  number  of 
disk  segments  needed  to  contain  an  edge  group  is 


s 


8N  kqd  +255 
256 


The  number  of  quadrant  words  needed  as  an  I/O  area  for  an  edge 


group  is 


T7EGR 


*F 


Likewise,  the  number  of  segments  needed  to  contain  a  mesh  block 


is 


s 


Nypq  +  3 


and  the  number  of  quadrant  words  needed  as  an  I/O  area  for  the  mesh  block 


is 


SQ       S 


For  calculations  on  block  (i,j),  an  area  must  be  set  aside  to 
store  the  right  edge  of  block  (i,j-l).  As  calculations  on  (i,j)  sweep  the 
block,  old  values  from  the  block  may  be  moved  into  the  edge  area  so  that 
when  (i,j)  is  completely  updated,  the  edge  area  contains  the  right  edge 
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of  (i,j).  The  number  of  quadrant  words  needed  for  the  single  edge  is 


Q 


8N  pd  +  63 
v 

— 55 — 


We  might  include  an  8  X  8  mesh  square  of  storage  as  a  token 
amount  for  overhead.   This  amounts  to  N  quadrant  words.   If  the  I/O 
sequence  for  row-normal  reading  is  examined,  it  can  he  seen  that  we  need 


«"   =  3WEE  ♦  iff  ♦  ItaOff",  iff)  ♦  iff  ♦  N 

mem     SQ,     SQ        SQ  '   SQ     Q     v 


quadrant  words  of  fast  memory  for  storage  of  the  data.   The  third  term  takes 

account  of  the  case  in  which  the  ending  of  one  row  moves  far  into  the  begin- 

ning  of  the  next.  A  similar  expression  W    exists  for  column-normal  reading, 

mem 

No  attempt  will  be  made  here  to  estimate  the  storage  required  for 
program  and  scratch  areas. 


U.3.2  Disk 

An  easy  way  to  manage  disk  is  to  allocate  half  of  the  disk  to  old 
mesh  and  edge  groups  and  half  to  updated  mesh  and  edge  groups.  This  proce- 
dure takes  no  advantage,  however,  of  the  space  on  disk  which  becomes  avail- 
able for  updated  mesh  blocks  as  the  mesh  is  swept. 

Referring  once  again  to  Figure  k,    note  that,  starting  from  block 
(l,l)  for  example,  successive  disk  blocks  are  filled  with  each  k  blocks 
added  to  the  mesh  row.   The  last  block  in  the  mesh  row  is  (l,i  k-2).   If 

I     <   T,  the  mesh  row  will  fit  into  one  logical  track,  but  if  i  >  T+l,  we 

n  —  °         '  n  —    ' 

will  have  to  use  another  logical  track  for  storage.   The  same  situation 


38 


V1 

occurs  for  the  other  mesh  rows.   One  mesh  row  would  require  ( L-= — J+l) 

logical  tracks. 

It  is  now  proposed  that  rather  than  using  T  out  of  T  disk  blocks 

for  storage,  we  use  only  T-l  out  of  T  blocks;  so  that  we  use  another  logical 

track  for  I     >   T,  and  yet  another  for  £     >   2T  (the  second  logical  track  is 
n  —  '  n  — 

filled  completely).   In  this  way  we  insure  that  there  will  be  an  empty  block 

immediately  before  blocks  (1,2),  (1,3),  ...,  (l,k).   We  may  then  write 

updated  blocks  in  these  spaces  as  we  sweep  the  row.   Updated  block  (l,k-2) • 

is  written  in  the  space  before  (l,k).   (l,k-l) •  is  then  written  over  (l,l) 

since  we  do  not  need  (l,l)  any  more.   Old  blocks  are  thus  successively 

overwritten  by  the  updated  (k-2)th  block  following  them  in  the  row.   One 

in 
mesh  row  then  requires  ( L^—  J+l)  logical  tracks;  and  m  mesh  rows  require 

in  ' 

m(  L— J+l)  logical  tracks. 

Note  that  this  procedure  for  managing  disk  also  works  for  column- 
sweeping.  Because  of  this  we  may  choose  the  smallest  of  two  possible  stor- 
age requirements: 

W^  =  Min(m(L-fj  +  D,  n(L-fM))   . 

No  such  game  can  be  played  with  edge  group  storage.  However, 
we  may  still  choose  the  minimum  requirement  for  two  possible  storage 
methods.  Attention  is  drawn  to  the  upper  edge  groups  (superscript  U)  in 
Figure  k.      One  storage  method  is  shown  in  the  schematic.   Groups  with 
successive  subscripts  are  stored  in  adjacent  blocks  on  a  logical  track 
for  i  <  T.   Group  1l  would  be  stored  on  logical  track  3  in  the  first 
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position,  although  it  is  not  shown  since  the  drawing  has  £     =   T-l.  Groups 
with  successive  Roman  numerals  simply  wind  around  disk  at  intervals  of  T 


blocks,  until  the  last  one,  I   k-2.  For  every  increment  of  I   ,  another 

,  V1  ~ —  m 

( L-= — J+l)  logical  tracks  are  added,  as  long  as  I  <  2T.   If  one  increases 

V1  m 

I     to  2T+1,  2.{Y— — J+l)  logical  tracks  must  be  added;  however,  we  will  con- 
sider only  1,1     <  10  and  T  >  5- 

The  second  possible  storage  method  is  to  put  R  ,  R  ,  ...,  R„  on 

L       d  \ 

different  tracks,  and  to  put  I,,  k+1  ,  ...,  ik+1  in  adjacent  blocks  on 

the  same  logical  track  for  i  <  T-l.   I, ,  II, ,  III, ,  ...  are  still  to  be 

stored  on  the  same  logical  track,  spaced  T  blocks  apart.   The  expression 

for  the  number  of  logical  tracks  required  is  the  dual  of  the  expression  for 

the  first  method  if  £  ,   £     <  2T. 

nr   n  — 

The  number  of  logical  tracks  needed  for  upper  edge  groups  is 


W^  =  MinUm(L^-J+D,  yi^-J+l)) 


We  allocate  four  such  areas  on  disk,  one  each  for  old  upper  and  lower  edge 
groups  and  one  each  for  the  new. 

The  amount  of  disk  storage  needed  is 


Wdisk  ■  ^  +  *«£> 


tracks.   Note  that  this  measure  is  in  disk  tracks  and  not  logical  tracks. 
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5.   NUMERICAL  SOLUTIONS 


For  a  given  problem  M,  N,  N  ,  d,  T  ,  a  scheme  must  be  found  with 
a  latency  which  is  within  an  acceptable  limit.  The  difference  T^  -  Tv 
should  be  accounted  for  in  the  latency  measure;  however,  this  section  will 
be  an  informal  discussion  of  the  existence  of  schemes  for  given  values  of 
M,  N,  N  ,  d  and  will  consider  only  the  internal  latency  of  the  scheme. 

If  the  values  of  M,  N,  N  ,  T,  t,  k,  I   ,  I     are  specified,  a 

'      '      v'      '      '      '      rrr   n  ' 

scheme  can  be  determined  if  all  of  the  relations  of  the  last  section  are 
satisfied.   The  latency  may  be  calculated  and  tested  for  acceptability. 
The  value  T^.  may  also  be  calculated;  as  well  as  a  maximum  allowable  value 
of  d.   If  we  are  interested  only  in  normal  reading  in  either  direction, 
then  we  may  leave  one  of  the  i-values  unspecified,  and  impose  some  addi- 
tional restriction  or  specify  some  other  parameter. 

A  program  has  been  written  for  the  purpose  of  investigating  the 
existence  of  solutions  of  the  parameters  for  various  problems  M,  N,  N  . 
A  simple  procedure  is  used:   T,  t,  k,  I    ,     are  iterated,  and  valid-  solutions 
for  which  the  latency  is  less  than  or  equal  to  .12  are  printed.  For  each 

T,  t,  k  values  for  £     and  I     are  tried  successively  in  an  attempt  to  find 

7      '  m      n 

a  scheme  for  column-normal  or  row-normal  reading  respectively.   The  program 
is  written  in  Burroughs  B5500  Extended  Algol,  and  it  is  listed  in  the 
appendix  to  this  report. 

For  each  mesh  size  M  X  N  squares,  values  3  to  8  in  unit  steps 
were  assigned  to  N  .  The  free  parameters  T,  t,  k  were  iterated  in  unit 
steps  over  the  ranges  5  and  6,  1  to  5 ,    and  h   to  30  respectively.   Values 


in 


for  I     ox  I     were  chosen  from  Table  2.   The  results  of  particular  interest 
m     n 

are  the  values  of  T^  and  ¥    obtained.   We  would  like  to  see  many  solutions, 

R     mem 

with  values  of  T^,  well  distributed  and  with  memory  requirements  very  low. 

A  survey  of  the  results  that  have  been  obtained  is  presented  in  Table  h. 

All  of  the  solutions  have  W    <  1500  and  W, .  ,  <  k8.      The  smallest  problem 

mem  —  disk  — 

listed  is  core  contained  for  N  =  3  and  k.      The  largest  problem  listed 

represents  about  one -third  of  disk  capacity  for  N  =8. 

For  each  solution  a  maximum  allowable  value  of  d,  d   ,  is 

7     max7 

calculated.   If  d    <  3,    the  solution  is  rejected.   Tests  have  shown  that 
max    ' 

no  more  than  eight  per  cent  of  the  solutions  obtained  by  the  program  are 

rejected  on  this  basis.   Fast  memory  requirements  are  calculated  for  d  =  3- 

There  are  few,  if  any,  finite  difference  stencils  in  use  for  which  d  is 

greater  than  three;  so  that  it  is  justifiable  to  group  all  solutions  with 

d    >  3  into  one  class.   Each  of  these  solutions  is  valid  for  d  <  3. 
max  —  — 

The  smallest  and  largest  values  of  T  ,  over  the  entire  range  of 

K 

the  parameter  N  ,  are  listed.   For  the  large  meshes  the  solutions  are 

numerous  over  small  ranges  of  T  ;  and  the  trend  seems  to  be  that  solutions 

K 

are  fewer  for  smaller  meshes  and  occur  over  larger  ranges  of  T  .   The  dis- 

K 

tribution  over  T  is  discussed  below.  At  this  point  a  comment  should  be 
K 

made  concerning  the  method  of  finding  the  solutions. 

The  program  used  to  obtain  the  results  incorporates  an  artificial 
restriction  on  the  mesh  block  size  which  is  equivalent  to  an  attempt  to 
minimize  the  value  of  T  for  given  scheme  parameters.   Within  the  program 
the  parameters  p  and  b  are  calculated;  and  from  these  a  largest  value  of 
q  is  determined  such  that  the  bound  W  pq  <  l+(b-2)  is  satisfied.   If  q  were 


1+2 


Percent 

of  Solutions 

Number  of 

m                       T")   *%.    »»-.    na    — 

With 

W         < 
mem  — 

MXN 

Solutions 

T     Kange 

K 

1+00 

800 

1200 

20  X  20 

0 

28  X  32 

270 

5.21  -  59.^3 

55 

100 

100 

30  x  i+o 

15* 

5.83  -  1+1+.00 

1+8 

99 

100 

35  x  35 

27 

6.00  -  1+8.00 

1+1 

100 

100 

1+7  x  1+7 

89 

5.16  -  28.37 

35 

85 

100 

^5  x  55 

1+05 

5.10  -  2k.  kh 

29 

79 

97 

55  x  65 

1+97 

5.09  -  17.09 

23 

73 

9h 

60  x  70 

1+00 

5.11  -  ll+.OO 

36 

9h 

100 

90  x  70 

507 

5.10  -  11.91 

33 

88 

98 

60  X  110 

675 

5.08  -  12.27 

29 

78 

96 

90  X  110 

567 

5.07   -   11.02 

33 

83 

95 

Table  1+.   Survey  of  results  obtained  on  11  mesh  sizes. 

Latency  <  .12,  normal  reading  in  either  direction. 
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iterated  downward  from  this  largest  value,  solutions  with  larger  T„  might 

K 

be  found.   The  program  uses  only  the  largest  q;  the  attempt  is  repeated 

with  the  variables  p  and  q  interchanged.   There  is  another  constraint  on 

q  which  leads  us  to  expect  higher  T-  for  smaller  mesh  dimensions.   Given 

n  and  N,  in  order  to  minimize  internal  latency,  q  should  he  the  smallest 

value  such  that  qn  >  N.  For  smaller  mesh  dimension  N,  this  bound  is  more 

likely  to  be  lower  than  the  bound  mentioned  above.   In  fact,  most  of  the 

solutions  for  the  small  meshes  had  m  or  n  equal  to  one  and  p  or  q  spanning 

an  entire  dimension  of  the  mesh.   In  such  solutions  an  entire  column  or 

row  of  8  X  8  squares  would  be  read  at  a  time:  and  the  W    were  overesti- 

7         mem 

mated  by  the  program  because  the  edge  group  areas  allocated  would  not  be 

needed. 

A  rigorous  mathematical  investigation  of  the  existence  of  solutions 

to  the  system  of  relations  has  not  been  performed;  and  the  program  used  does 

not  find  every  solution  possible  in  the  given  T  ranges.   In  fact,  because 

K 

of  the  peculiar  behavior  of  the  remainder  terms  in  the  integer  divisions, 

the  program  may  not  even  find  the  solutions  with  smallest  T  for  the  iterated 

K 

parameters.   The  results  obtained,  however,  are  interesting  even  without 
the  assurance  that  all  possible  solutions  have  been  found. 

The  last  three  columns  of  Table  k   give  the  percents  of  the 
solutions  found  for  which  fast  memory  requirements  are  less  than  or  equal 
to  UOO,  800,  and  1200  quadrant  words.  For  the  larger  meshes,  a  larger 
percentage  of  solutions  have  high  memory  requirements.   This  does  not 
necessarily  indicate,  however,  that  larger  meshes  require  more  memory. 
No  attempt  has  been  made  at  this  time  to  examine  in  detail  the  fast  memory 
requirements;  but  this  problem  is  worthy  of  further  investigation. 


kk 


The  relationship  between  T^  and  N  is  illustrated  in  Figure  10 

R      v 

for  four  pairs  of  mesh  dimensions.   The  highest  and  lowest  values  of  T 

K 

found  are  plotted  for  each  value  of  N  .   In  addition,  selected  solutions 

for  N  equal  to  3  and  8  are  plotted,  and  the  number  of  solutions  found  is 

printed  above  the  highest  point  for  each  N  .   The  additional  points  at 

N  =  3  and  N  =  8  are  selected  to  represent  the  densities  over  T^  of  the 
v         v  R 

solutions  obtained.   It  is  seen  that  the  density  increases  with  increasing 
mesh  size.  Further  tests  are  yet  to  be  performed  to  determine  whether  the 
procedure  of  iterating  p  or  q  downward  yields  values  of  T  which  would  close 
the  gaps  in  the  higher  regions. 

Note  that  the  highest  points  for  the  two  smallest  meshes  form 
straight  lines.   The  six  points  in  each  of  the  graphs  represent  the  same 

solution  with  only  the  difference  that  for  smaller  N  the  value  of  T„  is 

J  v  R 
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greater  and  the  fast  memory  requirement  is  lower.  A  solution  for  the 

N 
problem  M,  N,  N  ,  d,  T  is  also  valid  for  the  problem  M,  N,  N  -1,  d,  W~j^y} 

v 
but  it  is  not  in  general  valid  for  the  problem  with  N  +1   variables  because 

of  the  requirement  N  pq  <  U(b-2).  The  term  "solution"  here  refers  to  the 

set  of  parameters  {T,  k,  t,  I   ,   m,  I   ,    n,    p,    q,   b,  L,  W  .   )  where  L  is 

the  internal  latency  of  the  scheme.   Note  that  L  is  independent  of  N  and 

TR' 

A  straight  line  (with  the  slope  indicated)  may  be  drawn  to  the 

left  from  every  solution  on  the  graph;  solutions  exist  along  these  lines. 

One  such  line  is  drawn  on  the  first  graph  to  indicate  the  existence  of  a 

solution  at  the  point  marked  "x".   This  solution  would  also  be  found  using 

the  procedure  of  iterating  p  or  q  downward.   This  procedure  might  also 

yield  solutions  for  N  =3  which  are  invalid  for  higher  N  . 

v  v 
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The  most  interesting  region  of  the  graph  is  the  low  end  of  the 

T  scale,  since  for  smaller  1     the  disk  more  closely  approximates  a  fast 

memory  for  the  restricted  class  of  problems  considered  here.   The  results 

in  this  region  are  better  for  the  larger  meshes  than  for  the  smaller  meshes 

in  both  the  density  of  solutions  and  in  the  minimum  T  obtainable.   Figure  11 

K 

shows  the  minimum  values  on  the  same  graph. 

A  fact  of  moderate  interest  is  that  a  solution  for  the  problem 

M,  N,  sN  ,  d,  T  where  s  is  an  integer  can  be  modified  to  accommodate  the 

problems  sM,  N,  N  ,  d,  T^  or  M,  sN,  N  ,  d,  T  independently  of  the  direction 

of  sweeping.  The  modification  consists  of  multiplying  p  or  q  by  sj  T  is 

unchanged  and  W    decreases.   Non-integer  values  may  be  substituted  for 
mem 

s  if  one  is  careful  to  insure  that  wherever  s  is  used  as  a  multiplier,  the 
result  is  an  integer. 

In  an  automatic  method  for  finding  the  optimum  solution  for  a 
given  problem,  the  external  latency  should  be  added  to  the  internal  latency, 
and  the  result  could  be  used  as  a  measure  of  the  total  latency.   The  solu- 
tion with  the  smallest  total  latency  would  be  the  optimum  solution.   The 
external  latency  is  given  by 


L 


M,      xV* 

c  .   p-(W~Tr- 

external     I   (1200t)+Tb 
m 


for  column  sweeping.  An  exact  expression  would  be  obtained  by  replacing 
Tb  in  the  denominator  by  T       .   The  search  for  a  solution  could  con- 
tinue  over  a  wide  range  of  the  parameter  T.   Real  values  of  T  <  T„  or 
even  slightly  greater  than  Tv  might  be  tried.   Of  course  the  interval 
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Figure  11.     Minimum  values  of  T     found. 
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over  which  T  is  iterated  should  be  properly  discretized  to  assure  that 
disk  blocks  do  not  begin  within  a  disk  segment. 

A  search  over  very  many  values  of  T  would  be  lengthy,  and  an 
automatic  method  should  incorporate  a  procedure  to  decide  when  to  stop. 
The  first  solution  found  with  an  acceptable  total  latency  might  be  taken, 
for  example. 

The  efficiency  of  disk  allocation  (i.e.,  the  percentage  of  allo- 
cated space  which  is  actually  used)  has  not  been  considered  here,  but  it 

should  be  noted  that  for  T  «  T_.  this  efficiency  cannot  be  high.   If  this 

K 

is  an  important  factor,  one  might  start  the  search  with  higher  values  of  T. 

One  of  the  I     or  I     would  have  to  be  close  to  a  multiple  of  T  for  high 
m     n  xr  o 

storage  efficiency. 

We  can  expect  that  solutions  would  be  fewer  and  memory  requirements 
higher  for  problems  requiring  changes  in  the  direction  of  sweeping.   Changing 
direction  might  be  useful  in  handling  boundaries  of  the  mesh,  for  example. 
It  is  meaningless,  however,  to  speak  of  direction  changes  for  solutions 
with  p  or  q  spanning  a  dimension  of  the  mesh;  there  is  only  one  direction 
possible.   Since  most  of  the  solutions  for  small  meshes  were  of  this  type, 
we  might  expect  that  changing  direction  could  be  meaningful  only  for  large 
meshes. 

It  seems  that  the  ability  to  change  direction  of  sweep  is  of 
questionable  importance.   In  this  light  the  results  given  would  indicate 
that  the  scheme  presented  in  this  report  can  be  useful  for  non-core- 
contained  problems. 
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6.   CONCLUSION 


The  scheme  described  in  this  paper  is  only  one  of  a  family  of 
schemes  that  could  work.  Another  member  of  the  family  is  obtained  by 
interchanging  the  positions  of  upper  and  lower  edge  groups  R.   and  R. 
and  shifting  all  of  the  mesh  blocks  in  Figure  h   left  by  T  disk  blocks. 

It  might  be  possible  to  find  a  scheme  which  would  work  with  a 
logical  period  T  of  four.   This  might  be  done  by  allowing  transmissions  of 
two  edge  groups  within  the  space  of  one  disk  block.   It  is  clear,  however, 
that  no  scheme  could  have  a  logical  period  T  or  a  real  period  T  less  than 
two,  since  every  mesh  block  must  be  input  and  output.   The  value  two  is 
an  obvious  lower  limit  on  the  period. 

There  are  a  number  of  problems  in  which  several  different  kernels 

are  to  be  applied  to  different  subsets  of  the  N  variables.   There  is  no 
^  v 

provision  in  the  present  scheme  for  handling  these  types  of  problems 
efficiently;  on  each  sweep  of  the  mesh,  all  of  the  variables  are  transmitted. 
If  one  transmits  only  the  variables  needed,  then  the  value  of  T  is  eff ec- 
tively  increased  since  the  logical  period  T  must  remain  constant  during 
problem  execution.   Remapping  of  the  mesh  onto  disk  might  be  considered, 
but  this  would  not  be  an  easy  procedure  to  use.   The  two  problems  of 
changing  the  period  and  handling  overlapping  subsets  of  the  N  variables 
are  somewhat  related  and  are  worthy  of  further  investigation. 
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APPENDIX 

The  following  is  a  listing  of  the  Burroughs  B5500  Extended  Algol 

program  used  to  obtain  the  numerical  results.  A  note  may  be  helpful  to 

the  reader  who  is  unfamiliar  with  the  language.   The  operator  "/"  results 

in  a  floating  point  division  with  a  floating  point  result.   The  operator 

"DIV"  indicates  a  fixed  point  division  with  truncation  of  the  remainder. 

The  expression  "A  DIV  B"  is  equivalent  to  L— J. 

£5 
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B  t  G I  *» 

REAL     ARHAY     PCTtt«20]; 

REAL     TRE  AL.MilLT,l  T  C  Y  ,  C  P>  OM,  ()m,  L  TC  YM#  LK  YN>  TR,  TRM  T  N,  TRM  AX  > 

INTEGER    ARRAY     THI  I  5  I  1  i! »  0  :  7  ]  >  H  1  ST  t  1  I  20  ]*  MMM,  NNN  [  1  t  1  5  1  ; 

integer  mv.tp»omax*k»  1  »p.u»nvpq»b#     ktmi»lt»pm2>fbm2»   unfil, 

N»m,nn>mm»LM,LN.  ILM#  ILN*S»  SUM,MAxPQ#  I»DSKM#0SKN# 

FnGM,E0GN,Dl<^»LMKTM1,LNKTMl,ISZ#TI»D.ENPK,WEGSQ»WMB<i0#WFg»WMEM) 
ALPHA     SYM,SYM1) 
FILE    PTO'iT     15(  1  #1  I  )f 
LARFI      SKTP,SkIP2.C YC J 
LIST    LSTrNV* TP»LM,M»l N.N. TREAL» SYM,  I)>K  »  T*P*  Q»  MVPO*  B  #  llNF  IL  *  RULT  # 

LTCY#SYM,SYM1>    OM,DN,wMEM.DISK)J 
FORMAT  FA(//2I5,"    =  MM.NN        CRITERION  *",F5.3/ 

"N\/       Tp  is  m  LN  N       TREAL  D  K  T  PC  NVP 

Q  B       UNFIL    R'lLT  LTCY  OMRL     DNRL  MFMRY    DISK"}* 

FH(l2.v2#T2»2(x3M^»H)'X2.F5,2»Xi,Al»X?,T2»X3,i2*X3»n»IS»H» 

x3,T4,YJ.n,yi.l3,y?,E4.3,x<»#F<»,3»xl>2Al,X4,x3»2F5,1,I8,I5), 
Fwv("**«*«*  1PMTn="F5.2m  TRHAXe"F5,2M  SUMx"I5)> 

F  MSTD(»FA<;T-Vf  M    hEylUREMENT    0  I  STR  I  BljT  T  ON"  v5"N0  .     OF    SOL^S    =  "I6» 
*10,,F0(?MI^"*,,I  3// 
"     Q-W"Y6"FWiO        PCT     ACCOM*/)* 
FHSTC  14.  I  1 0,F  7  ,  1  )  I 
THL  F  5  ,  *  1     WITH    4,1 »b»b» 10> 
THLTft»*]     WITH    5.  l»3»t>»  7.91 
TBL  r 7 ,*  l    wlTH    5,1.3, f #«, 10; 
THL  r«,*  1    WITH    5,1  »3*!>*8»VJ 
TBL  r  9  »  *  1     WITH    6, 1 , 3.b#6.9» tOJ 
T  H  L  T  1  0  ,  *  ]     wTTH    6»l*3*S*6*7*10j 
TBLfll**]     WTTH    («■,  1  ,  1,5#*»  7»fl  » 


FILL 
FILL 
FILL 
FILL 
FILL 
FILL 
FILL 
FILL 
FILL 
FILL 


T  H  L  r  1  2  »  *  I  WTTH  J»1»*,5»*#7»B»9; 

MMMf*1  wlTH  30. ?j*i*f 35. <tb»«7.55»60»60»90»9n J 
NNNT  *  1  WITH  40»?u»  32.  35,55»47#65>70»  U0»70»  1  10J 
CH*.12; 

FOR  TS/«-<  STFP  1  UNTIL  11  00 
RE<",  I  M 

FOR  I«-l  STEP  1  Ll\T  1 1  20  DO  HlST[I]«-OJ 

WRITECPTOHTTPAGF  J  ); 

MM*MMM[ ISZ ] S  NN^NNNT  ISZ1  J 

WRITE(PTOUT.FA,^M.NN.CR>; 

FOR  NV«-3  STEP  1  UNTIL  8  DO 

R  E  G  T  N 

IF  NVxMmxNn>210uOO  THEN  GO  TO  SKIP?I 

TRMIN»099J 

TRMAX4-0  I 

SU^«-01 

FOR  Tp^b.6  OH 

FOR  T«-1  STfP  1 

FOR  KM  STEP  1 

FOR  ILM«-1  S1EP 

BEGIN 

MM*MMMl  I  S7 J t     NN*NNNt ISZ] J 

s  y  m  l  ♦  »•  " ; 

KTMl<-K«TP-l  ) 
R«-(LT«-l20nxT)  D  I  V 
UMF IL*.T-KTMl*b; 
IF  T=l  THFN  SYM>" 
FL*E  IF  1?00 


UNTTL  5  00 

UNTIL  30  00 

1  UNTIL  TRL[TP»01 


ktmi; 


DO 


MOO  B  s  0  THEN  SYM«-"E" 


52 


FlSE 
FlSE 


IF    ?x(T-l)iUNFlL    THFN    SYM*"A« 

riEGTN 

B«-B-U 
UNFIL*UNFIL+KTM1 I 

SYM«-"B" 

LNO  I 
BM2*B-2> 
FRM2*4kBM?J 
LM«-TRl.tTP»  1LM1J 
M4-LM*K-2J 

cyt  » 

p«.(MM  +  M-l)  DIV  M) 

Q«-FrtM?  DIV  (NV*P)J 

IF  0=0  THFN  GO  TO  SKIP; 

N«-(NN  +  U-n  DIV  Qi 

Q«-(  NN+M-1  )  DIV  N) 

IN«.(K  +  1*|0  Dlv  KJ 

NVPQ«-NVXPxQI 

RtlLT*IINF  11  /LT) 

DM4-M-MM/PI 

DN«-N-N\/G) 

TRE  AL«-<i*TP*R/NVPQJ 

LTC/M«-(  (TPx(3  +  DM)-I.M)xB+.?5xNV)«MMx(0-NN/N)xTRFAL)/ 

( LMXLT*TP*B)J 
LTC  Y  +  |  TCYM  +  RUI  T) 
IF  LTCY>CR  THFN  GO  TO  SKlPI 
DMAX«-c2S6*bM2l  DIV  CENPK«-fl*NV*P*K  )  J 
IF  UPAXO  1HEW  GO  TO  SKlPl 

D«-JJ 

WFGS0*ftx((UxE^PK*255)  DIV  ?56)J 

WMBSQ*4X( (NVPP*3)  DTV  4}) 

WFQ«-(B*NV*WxD  +  63)  0 1 V  64*  -urM   c  /.  c  « 

WMEM«-3«WEf,SQ*WMBS0*WEQ  +  NV  +  (IF  wEGSQ>WMBSO  THEM  WEGSO 

FlSE  wMBSOJ 
IF  wMFUM^OO  THEN  GO  TO  SkTPJ 

S«-((lF  WMFM$2n00  THEN  KMEm  ELSE  20005-1)  DIV  100  ♦  U 
D^KM*(M-ENTTEC(DM))x(LN  DIV  TP  *l)l 
d<;kn«-n*(Lm  DIV  TP  *l)J 
FOGM  +  I  mx( (LN-1 )  DIV  TP  ♦ 1 ) J 
E0GN4-1  N*((IM-1  )  DIV  TP  *1)» 

DTSK*Tk((TF  D<*M<DSKN  THEN  OSKM  FLSE  DSKN)+ 
U*L\r    EUr.H<EDGN  THEN  EDGM  ELSE  FDGN))) 
0lSK>4«  THFN  GO  TO  SKlPI 
SYm1*"N"  THEN 
H I G  I  N 

t i *l m;  lm«-ln»  ln«-tii 

T  T  4- M  i  M*N»  N*TI) 
Il4-p;  P*QJ  0«-Tl) 
IR4-HMJ  DM*DN)  DN«-TR 

WR  ITE ( PTOHT»FR*LST ) J 
IF  TRMlN>TREAI   THEN  TRK I N*TRE AL > 
ir  TRMAX<TKEAI   THEN  TRMA  X«-TRE  A  L  J 
SDM«-SlJM*l  I 
HIST(5J«-HISTIS]*1J 
SKlPl  IF  SYMU-  ■  THEN 
M  l  G  I  w 


IF 
IF 


53 


MM*MNN[  TSZ1I  NN«-MMM[  ISZ1I 

SYM1*"N"I 

bO  TO  CYC 


ffcn 


FND» 
WRTTE(PTOUT#FNV#TRMlN,TRMAX>SUM) 

FNOt 
MRTTF(PTnuTtPAGEl)| 

TOR  T*l  STFP  1  UNTIL  *0  DO  Pr T [  I  ]*SUM*SUM*HI ST  til  J 
WR!TF(PTnijT#FHSTn»SUM#MMM[IS7]»NNNtISZ])J 

TF  SHM>0  THEN 

FOR  T«-l  STFP  1  UMTIL  ?0  DU 
HtGTN 

PCTtI]*PCT[ 1  1/SuMJ 

WRlTE(PlOUT»FHST»100Kl.HlST[I]#100)«PCTrI]) 
FNOt 
SKIP9I 
FND 
FNn. 


5h 


LIST  OF  REFERENCES 


[1]   Barnes,  G.  H.,  et  al. ,  "The  ILLIAC  IV  Computer,"  IEEE  Transactions 

on  Computers,  Vol.  C-17,  No.  8  (August,  1968),  pp.  7^6-757, 


55 


{  A 


x» 

It 

•  •  • 

is, 

t 

•    •    • 

3tt? 

•      •     • 

nV 

•    •    • 

IV 

•      •      • 

s>: 

SEL*t 

•  •  » 

!t±*J» 

y^TSt 

«     •    • 

•    •    • 

igg» 

•    •    • 

•    •    • 

£1" 

*i 

... 

£E3i» 

tt.*-^- 

•    •    • 

•    •    • 

*E^ 

•    •     • 

•    •    • 

H 

1 

- 

__-- 

■ 
B 

t,K —  » 

^V^ 

13 

1,K*3 

14 

1,  K.+-4 

_-  -   — 

--- 

t,n-t 

s 

E2 

N^ 

23 

— 

^  ^  J 

-J.K--S 

-51 

3,  K  +  l 

X 

3Z. 

X~' 

4,  K-4 

4,x 

41 

4,  K*l 

| 

,_--- 



K-1,1 

__  ,_  • 

_  _  -  —  " 

-- 

,,  ^ 

VN^' 

- 

m  _ 

K+1,3 

s 

*^ 

mission  is  started.  This  access  time  is,  in  general,  not  predictable;  but  it  is 
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the  object  of  the  present  investigation  to  minimize  the  latency  for  a  reasonanbly 
large  class  of  problems. 
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